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Abstract 



.2 

O i We provide a mean-field analysis of community structure of social 

and biological networks assuming that actors are able to evaluate some 
tree-derived distance to the other actors and tend to aggregate with 
the less distant. We show that such networks have small components. 



/\ ' and give exact descriptions for the probability distribution of a typical 

j^ ■ community size and the number of communities. In particular, we 

show that the probability distribution of the community size is well- 
approximated by a power-law distribution with exponent two. We 
illustrate the robustness of the mean-field analysis by comparing its 
predictions on previously studied social networks and biological data. 

Key-words: Community structure - Random trees - Coalescence - Dis- 
tributional recursions - power laws - kin networks - kin selection. 
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1 Introduction 

Social networks have recently emerged as a paradigm of the complexity of 
human or animal interactions (Newman, 2003; Wasserman and Faust, 1994; 
Franck, 1998; Scott, 2000). Such networks are sets of actors with some pat- 
tern of contacts or interactions between pairs represented as edges in a graph. 
It is widely assumed that most social networks show community structure, 
i.e., groups of strongly connected vertices, with few connections between 
groups (Girvan and Newman, 2002). Community structure gives raise to a 
hierarchy of nested social relationships, which in turn can be thought of as 
a special kind of binary tree called a dendrogram (e.g. Guimera et al., 2003; 
Arenas et al., 2004). 

Algorithms that seek community structure in graphs often attempt to 
reconstruct such a tree, and those that do so generally fall in two main 
categories: hierarchical clustering and edge removal (Scott, 2000; Girvan 
and Newman, 2002; Newman, 2004; Radicchi et al., 2004). In this study, 
we adopt a slightly different perspective on social networks, in which the 
network itself derives from a hierarchical process represented as a tree. The 
novelty is that the trees are considered as unobserved/hidden data, and there 
is no attempt to reconstruct them. Instead, the trees are viewed as random 
objects which enable us to make predictions about the shape of the observed 
community structure. 

The network formation requires the actors to have an ability of assessing 
a (perhaps subjective) distance deduced from the tree (see Boguiia, 2004 and 
references therein for similar postulates). Such distances are sometimes called 
ultrametric. Then the network evolves from the preferential attachment of 
each actor to the subset of her less distant actors. Here we present a mean- 
field analysis of community structure under this model. More specifically we 
describe the probability distributions of a typical community size, and the 
number of communities in the network. 

In Section 2, we give a description of the mean-field theory for tree-derived 
networks, and show that the quantities of interest are involved in recursive 
distributional equations. In Section 3, we prove that the networks have small 
components, with community size depending logarithmically of the network 
size, while the number of communities depends linearly on the network size. 
Then we study a variant of model with additional clustering, and obtain a 
number of useful extensions of the previous results. Section 4 illustrates and 
tests the robustness of the mean-field theory on two lists of examples, one 



from the social network literature, and the second from the sociobiology and 
ecology literature. 

2 Mean-field models 

Trees. Ruling the basic principles of social network formation is an highly 
difficult task. There is a large tradition in sociology for extracting community 
structure from a general network by cluster analysis (Scott, 2000; Newman, 
2003). This method assumes a hierarchical organization of the network based 
on pair similarities (or distances). Cluster analysis can generally be repre- 
sented by a binary tree structure (a dendrogram). Starting with n vertices 
and no edges, one adds one edge between the pair with the strongest similar- 
ities. Then, the two vertices are aggregated, and the distances to remaining 
vertices are recalculated. The process is iterated until all vertices aggregate. 
The connection between social network and trees is also exploited in recon- 
struction algorithms that remove edges to the network progressively (Girvan 
and Newman, 2003). 

In this study we assume that the network derives from a tree. The tree 
has internal branches that links the internal nodes to the root, and external 
branches that starts from the tips. The network is formed by going backward 
along external branches of the tree until a first ancestor is met. Edges are 
then drawn from each tip to all the descendants of the ancestor obtained in 
this process. We call this type of construction a kin network by analogy with 
biological networks where the tree represents a common genealogy. This con- 
struction is actually inspired from a biological process called kin recognition 
in which related individuals can recognize their kin, and attach preferentially 
to their closest relatives. This process forms the basis for the evolution of 
altruism (Hamilton, 1964). In the sequel this model is also referred to as 
the perfect clustering model in contrast with an imperfect clustering model 
presented afterwards. 

Proposing models of random interactions between actors using learning 
and rationality to evolve the structure is a standard approach in sociology and 
sociobiology (see Skyrms and Pemantle, 2000 an references therein). Because 
it is more amenable to analysis, we consider a mean-field approximation of 
these interactions through the underlying tree process. In the mean-field 
approximation, the tree is random. It starts with n tips (the actors), adds 
one edge between a randomly chosen pair, and then coalesces the two tips into 



an ancestor. This model is often called a coalescent tree (see Aldous, 1999), 
and arises as a robust approximation of the neutral genealogical process in 
population genetics (Kingman, 1982). In analogy with studies of genetic 
polymorphism, we never attempt to reconstruct the tree. The coalescent 
model is used as a basis for analyzing data such as community sizes or number 
of communities in a network. Still in analogy with population genetics, the 
coalescent tree may also serve as a model for testing the null-hypothesis that 
social networks evolve under random/neutral interactions. 

Recursive definition of random trees. Random coalescent trees share 
the same topology as other well-studied branching processes (Yule, 1924; 
Harding, 1971; Aldous, 2001). Considering n tips, these trees have the par- 
ticular property that the size L„ of the left sister clade at the basal split of 
the tree has uniform distribution over the set {1, . . . , n — 1} 

V{Ln = l) = ^—, £=l,...,n-l, 
n — 1 

and this property is also valid within each subtree. From this, Aldous (1996) 
proposed a recursive definition of dendrograms through a split distribution, 
the distribution of the left sister clade given the size of the parent clade. The 
connection been random trees and recursive structures have been exploited 
by Blum and Frangois (2005a) to prove results about minimal clades in the 
neutral coalescent. Their results can be rephrased to say that the outdegree 
out„ of an arbitrary vertex in a network with perfect clustering has a power- 
law distribution with exponent a = 3. More precisely we have 



P[out„ = x] = —, -, a; = 1, . . . , n — 2 

^ ^ x(a; + l)(x + 2)' ' ' 



and P[out„ = n — 1] = 2/n{n — 1), where n is the network size. Power-laws 
are not surprising in this context since this parallels similar results for (per- 
haps undirected) networks with incremental construction such as the Albert- 
Barabasi model (Albert and Barabasi, 2002) or the Price model (Price, 1965). 
See also (Newman, 2003) and (Durrett, 2006). 

Imperfect clustering. Communities in real social networks may some- 
times consist of two or several sub communities whereas this property is partly 
missing in the perfect clustering model for which each community is a clade 



subtended by and edge that connects a tip. Therefore we consider a modifica- 
tion of the basic model that tolerates imperfect clustering without modifying 
the underlying tree model. In the imperfect clustering model, communities 
may sometimes arise from the random coalescence of two previously formed 
clusters in addition to those created from the perfect clustering process. We 
also assume that the random clustering events occur during the construction 
process at a rate p, called the clustering rate (See Fig.l). 

Distributional recursions. In this work the recursive definition of ran- 
dom trees is exploited to study the mathematical properties of community 
structure under the mean-field model. This is done by using distributional 
recursions. We call a typical community the network cluster that contains 
the leftmost tip in the underlying tree (the tip labelled 1). Because in the 
mean-field model the n actors play exchangeable roles, studying the leftmost 
actor's community amounts to study an arbitrary community. We denote 
the community size by Sn for n the total network size. Obviously, we have 
5*2 = 2, and S3 = 3. To give a recursive definition of Sn (and then forget 
the tree), let us split the tree at the root so that two sister clades of sizes 
Ln and i?„ = n — Ln are obtained, and let /„ = min(L„,i?„). The com- 
munity size Sn can be recursively defined by S'„ = n if J„ = 1, otherwise 
Sn = Sl„- In this definition, the replicates of L„ are recursively sampled 
from the uniform distribution. The above set of recursive equations basically 
translates the idea of self-similarity and the scale-free property for a typical 
community size, but it also provides us with an efficient simulation algorithm 
for the probability distribution of Sn that avoids the simulation of the tree 
itself. Sets of recursive distributional equations such as those described here 
also appear in computer science and are natural in the analysis of random 
divide-and-conquer algorithms (Rosier, 2001; Hwang and Neinninger, 2002; 
Blum and Frangois, 2005b). Regarding the number of communities Nn we 
have N2 = N3 = I. Like Sn, Nn is involved in a set of recursive distributional 
equations. The number of communities can actually be defined as Nn = 1 
if /„ = 1, and otherwise Nn = Nl^ + N^^ where A^^ denotes an indepen- 
dent copy of Nn- Again the above recursive equations provide an efficient 
simulation algorithm for the probability distribution of Nn- 

Turning to the model with imperfect clustering, the equations for the 
community size Sn change as follows. We now have ^^ = n if /„ = 1, 



otherwise 

n with probabihty p 



g _ 

Sl„ with probabihty q 

where q = 1 — p. Regarding the number of communities we have A^„ = 1 if 
/„ = 1, and otherwise, 



'■n 



( 1 with probabihty p 

" " I Nl^ + N^^ with probabihty q 

Community structure in the mean-field model and recursive computations 
for Nn are illustrated in Fig.l where an example with n = 12 vertices and 
perfect and imperfect clustering is presented. 



3 Community size and the number of com- 
munities 

3.1 Typical community size 

Probability distribution. We consider the probability distribution of Sn, 
and denote it by p„(x) = P(S'„ = x) for all 2 < x < n. Then, for large n, we 
have 



5 

e 



and for all x > 3, we have 



Pn{x) ~ {-iy+^2x\ -^, as n ^ oo (1) 

where e{x) = '^k=2i~^)^/^^- ^^ defined as the exponential sum function ex{z) 

at the point z = —1. In particular, for a: = 3, we obtain that Pn(3) ~ 

2(3-e)/e, etc. 

The key argument for obtaining the above probability distribution is the 

use of the recursive equations defining Sn- From the formula of conditional 

probabilities, we see that the j9„(a;)'s are involved in sets of recursions of the 

following form 

1 1 

Pn+l{x) = (1 )Pn{x) + -Pn~l{x) 

n n 

for n > a; + 1. For x = 2, the initial values are P2{2) = 1 and P3{2) = 0. 
For X > 3, the recursions start from Px{x) = 2/{x — 1) and Px+i{x) = 0. 
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Figure 1: Recursive computations of Nn forn = 12. (A) Perfect clustering: 
the network has 4 communities. (B) Imperfect clustering: Two clustering 
events occur and are symbolized by circles. The network has 3 communities. 
Letters at the tips stand for community labels. 



The proof of the result stated in Eq. [T] follows from considerations about 
the differences Un+i = Pn+i — Pn and elementary calculations. Numerical 
computations show that the approximations given by Eq. ^ are accurate for 
n > 20. For large x and n, there is a perfect agreement with a power-law 
distribution of exponent a = 2 

p(x) ~ — — , as X -^ oo . 

^ ^ {x + l){x- 1) 

Numerical computations (not reported) show that the large n - large x ap- 
proximations are accurate for x > 25 and n > 100. 



Expected community size. The above result suggests that under the 
mean-field model with perfect clustering, the size of a community grows to 
infinity as the number of actors increases. Here we give a more precise result, 
which states that the growth is in fact very slow. Considering the expected 
size, we denote s„ = E[Sn\, and obtain that 

s„ ~ 2 log(n), as n ^ oo. 

In other words the kin networks studied here have small components. The 
sets of recursions for s„ are similar to that obtained for pn- Actually we have 

Sn+l = (1 )Sn + -^^^^ + -, n>3, 

n n n 

and the initial values are S2 = 2 and S3 = 3. The equation involving the 
difference Un can also be solved, and leads to m„ = 2An-i/n\ where An-i is 
the alternating factorial sum (Sloane's sequence number A005165 in EIS)). 
For large n, we obtain that Un ~ 2/n using (Abramowitz and Stegun 1970) 
and this leads us to the conclusion that s„ ~ 2(7 + log(n)) with 7 the Euler 
constant. To establish comparisons with the sequence 21ogn, we find that 
s„/logn ^ 2.04 for n = 1000, and s„/logn ^ 2.02 for n = 100,000. 

More generally, if we let s^ = E[S^], k > 1, denote the kth moment of 
the community size Sn, then for large n and k > 2, we obtain that 

s„ ~ n , as n ^ 00. 

k-1 

In particular we have s"^ ~ An, and the variance of Sn grows as An. 



3.2 Number of communities 

Probability distribution of the number of communities. The proba- 
bihty distribution of Nn can be computed exactly by solving a triangular sys- 
tem. If we let TXn{x) = P{Nn = x) for all integer x, we have 7r„(l) = 2/(n — 1), 
and 

7rn(a;) = Y] Y] ■Ke{y)Trn-eix - y) 1 < a; < [n/2j . 

i=2 y=l 

Expected Number of communities. The expected number of commu- 
nities is proportional to the number of vertices in the network 

l-e-2 
e„ = E[Nn] ^ en, c = — - — = 0.216 . . . 

for large n. From the recursive definition and a basic use of conditional 

probabilities, we obtain subsets of recursive equations for all the moments of 

Nn- In particular, the expected number of communities satisfies the following 

recursion 

/ 1\ 2 

Cn+i = 1 e„ + -e„_i, n > 3 



n J n 

where the inital values are 62 = 63 = 1. In the appendix, we show that this 
leads to 

e„ = (1 - e-2)(n + 2)/4 + 0(27(n - 1)!) . 

Convergence in probability. To see a convergence in probability result, 
remark that t„ = EyN"^^ solves the recursion 

Wi -K = -{tn - tn-i)/n + t„_i/n + 2(r„+i - rn)/n, 



en 



with the residual term r„ equal to r„ = X]r=2 ^i^n-i- Having proved & 
for large n, we can check that the residual difference term is equivalent to 
r„_|_i — r„ = c^n^/2 + o^n"^). This leads to E[N^] = e^n^ + o{n^) which can be 
translated into a convergence in probability result {N^/n -^ c) by a standard 
application of the Chebischev's inequality. 

3.3 Imperfect clustering 

Assuming imperfect clustering at rate p modifies the recursions for S'„ and 
Nni and complicates their mathematical analysis. Nevertheless, the main 
results can be summarized as follows (see the Appendix for details). 
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Community size. Assume that the clustering rate is positive p > 0. For 
1 < X < n, we let p„(x) = P(S'„ = x). Then, we have for all n > x + 1 

1 q 

Pn+lix) = (1 )Pnix) + -Pn-1 (2) 

n n 

where Px{x) = p + 2q/{x — 1) and Px+i{x) = 0. 

To discuss the probability distribution of Sn under imperfect clustering, 
let us distinguish the case x = 2 from the general case. For x = 2, Eq. |21 
starts with the initial values P2{2) = 1 and ^3(2) = 0. We set 



Hp) = q [ y'e^'il-yydy. 
Jo 



When n grows to infinity, we obtain that 



Pn{2) ~ r—-I{p)n P, n^oo 

r(g) 



For p — > 0+, we have 



-I{p) = + ap + 0{p^). 



T{q) ''-' e 

With a = e(7 — 1 + e~^Ei(l, 1)) = 0.0639 (Ei is the exponential integral, see 
Abramowitz and Stegun, 1970). For a; > 2, we denote /„ = pn+x{x), and we 
have 

{n + x -l)fn = {n + x-2)fn-i + qfn~2, (3) 

where fo=p + 2q/{x — 1) and /i = 0. Using notations similar as above, we 
set 

h{p) = qi2 + ix-3)p) [ y'^e'^yil-yr^dy, 

Jo 
and, we obtain that for large n, 

The expected value s„ = E[Sn] solves the following recursion 



(l--]sn + -Sn-l + 2(p + 



' n 



where the initial values are S2 = 2 and S3 = 3. The solution satisfies 

2p 

Sn ~ n , 77, — >• 00 

1 +P 

for < p < 1. 
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Number of communities. Regarding the number of communities, we 

let 7r„(a;) = P(A^„ = x). Tlien tlie distribution 7r„ can be calculated using 
triangular induction as follows 

7r„(l) =p + -, 

n — 1 

and 

n-2 x-1 

T^nix) = V" Yl T^iiy)^n-iix - v) , 2<X< [n/2j . 

1=2 y=\ 

We can use the above set of recursive equations to compute exact distribu- 
tions up to network sizes greater than n = 500. In addition these equations 
enable us to obtain maximum likelihood estimates of the clustering rate p 
using either basic grid or more elaborate dichotomic searches. 

The expected number of communities e„ is involved in the following re- 
cursion 

/ 1 \ q p 

e„+i = 1 e„ + 2-e„_i + -, n > 3 

y nj n n 

(e2 = ^3 = 1), that can be solved, and when n grows to infinity, we find that 

' c(p)n^~^P if p < |, 

ilogn ifp=^. 



e„ ~ < 



ifp> i 



where < p < 1, 



and 



2p-l ^^ f ^ 2 ' 



^(^) ^ ^ (^ + ^^^^^ 



Jo 
The exact expression of c{p) is not simple, but we have 

c{p) ~ c + (ip + o{p), as p ^ 

where 

d = -Ei(l, 2) + (7 - 2)e-2 + log 2 + 1 ^ 0.7747 
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All the equations given in Sections 3.1-3.2 can be retrieved for p = 0. We see 
that the model undergoes a phase transition at p = 1/2 where, for p < 1/2 the 
network has essentially small components, and for p > 1/2 a giant component 
may emerge. Examples of the above described probability distributions with 
different network sizes are displayed in Fig J4.1lH^ These graphics are taken 
from the real examples discussed in the next section. 

4 Examples 

4.1 Collaboration and frienship networks 

Zachary's frienship network A much-analyzed example of social net- 
work is a karate club observed over two years by an anthropologist, Wayne 
Zachary in the 1970s (Zachary, 1977). The network of friendships among the 
club members has been depicted in a graph by White and Harary (2001). 
The "karate club" network of Zachary was studied previously by a number 
of other authors in this context (Girvan and Newman, 2002; Zhou, 2003). 
During the period of study, the club splitted in two with those closest to the 
leader (the karate teacher) following him, and those closest to the adminis- 
trator as a result of a dispute between two factions. Previous studies have 
found that the fault lines along which the split occurred are readily visible 
in the structure of the network. The network size is n = 34 (number of club 
members). Under the mean-field model, we obtained P(A34 < 2) = 0.087, 
which can be considered as a one-sided p- value. We computed a one-sided 
p-value because the model is generally more likely to underestimate the true 
community size than overestimating (here we have E[iV34] = 7.1 > 2). The 
clustering rate was estimated as p = 0.58. Assuming an error oi p = 1/2 
during the network construction, the p-value raised to 0.704. In this and 
the next examples, the second p-value can be interpreted as a type-II error 
when the perfect clustering model is rejected against an imperfect cluster- 
ing model. In the Zachary's club example, the perfect clustering model is 
rejected at the confidence level a = 0.087, but the power of the test is low 
(around 0.3). 

American football conferences An interesting example in Girvan and 
Newman (2002) is the network of United States college football, a represen- 
tation of the schedule of Division I games for the 2000 season: vertices in the 
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Figure 2: Probability distribution of Su (left) and A^34 (right) for p = (up) 
and p = 0.5 (down) corresponding to Zachary's friendship network. Data 
from (Zachary 1977). 



13 



graph represent teams. It has a known community structure, and the recon- 
structed tree pubhshed in the original study matches the model presented 
here quite well (most communities end with an external branch). In this 
example the community structure come from geographical (and historical) 
relationships between colleges. The network has 115 teams and 12 confer- 
ences. Computing the distribution of the number of communities under the 
mean-field model, we obtained that P(A'^ii5 < 16) = 0.063. The estimated 
clustering rate was p = 0.2, Assuming an error of p = 0.2 during the network 
construction, the p-value increases as P(A^ii5 < 16) = 0.654. 

Santa-Fe collaboration network. Girvan and Newman (2002) have also 
applied their community-finding method to a collaboration network of scien- 
tists at the Santa Fe Institute, an interdisciplinary research center in Santa 
Fe, New Mexico. The 118 vertices in this network represent the largest com- 
ponent of the collaboration graph among scientists in residence at the Santa 
Fe Institute during any part of calendar year 1999 or 2000 and their collabo- 
rators. An edge is drawn between a pair of scientists if they coauthored one or 
more articles during the same time period. The algorithm splits the network 
into six strong communities, which lead us to estimate a large clustering rate 
p = 0.36. In this example the neutral model is rejected (p-value = 0.029), 
perhaps due to the fact that the algorithm found surprising groupings, and 
the network contains ties between researchers from traditionally disparate 
fields. Girvan and Newman conjectured that this feature may be peculiar to 
interdisciplinary centers like the Santa Fe Institute. 

A remark about the average community size. The ecological or socio- 
biology literature is not always as formal as we are regarding the average 
community size. The typical community as studied here contains one pre- 
specified individual. If there are Nn communities in the sample, then the 
average community size is generally computed as 

Average community size = N^ S^/Nn 

i=i 

where the S^ are the sizes of the distinct communities within the sample. 
This quantity can be equivalently formulated as n/Nn, and its expectation 
differs from s„. Among the examples in the previous section, only the football 
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Figure 3: Probability distribution of 5*115 (left) and iVii5 (right) for p = 
(up) and p = 0.2 (down) corresponding to the current American football 
conference network. Data from (Girvan and Newman 2002) 
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conferences met the criteria of large size and consistency with the mean-field 
model. In this case, we had n/Nn = 9.58 which was strinkingly close to 
Sii5 ~ 21og(115) = 9.48. In general, the bias can be stronger. 

4.2 Groups in social animals 

In the wild, social animals (and especially social carnivores) usually live in 
small sized groups. The groups are given different names according to the 
species. For example lions live in prides, dolphins live in pods, or wolves live 
in packs. A process called kin-selection was suggested by Hamilton (1964) 
as a mechanism for the evolution of altruistic behavior, and as one of the 
mechanism that may explain the formation of kin-networks in social animal 
species (Dawkins, 1989; Forster et al., 2006). The process can be sketch as 
follows. Since identical copies of genes may be carried in relatives, a gene 
that favors altruism may become successful provided the reproductive ben- 
efit gained by the recipient of the 'altruistic' act compares favorably to the 
reproductive cost to the individual performing the act. In this comparison, 
the reproductive benefit gained by the recipient is weighted by that the ge- 
netical relatedness (r) of the recipient, defined as the percentage of genes 
that those two individuals share by common descent. Kin-selection involves 
kin recognition at a basic level, and shares similarities with the aggregation 
model presented in this study. For instance, genetic relatedness corresponds 
to a natural measure of closeness between living organisms, and obviously 
derives from a (genealogical) tree. 

In the next paragraphs we compare the mean-field model predictions 
with published data that report precise sample sizes and observed number 
of groups in Wolves and Lions, where kin-selection is often assumed to be 
acting (see e.g., Rodman, 1981). Many workers have proposed the alterna- 
tive idea that the reason social carnivores live in groups, or packs, is because 
group hunting facilitates their acquisition of large prey (Mech 1970; Nudds 
1978; Pulliam and Caraco 1978). However this idea is not shared by all, and 
recent summaries have argued that communal hunting has little power to 
explain group patterns in felids (Packer et al. 1990) and across social car- 
nivores in general (Caro 1994). Beyond this discussion a general consensus 
that kin-selection contributes to the organization and evolution of animal 
social structures remains. 
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Wolf packs. Grey wolves ( Canis lupus) are pack-living animals with a com- 
plex social organization. Packs are primarily family groups. Packs include 
up to 30 individuals, but smaller sizes (8-12) are more common. A review of 
wolf social behavior and ecology can be found in Mech (1970). We use data 
from three sources: The Wolf project of Yellowstone national park which an- 
nually publishes accurate data on wolf pack sizes (Smith et al., 2002; 2004), 
and studies of wolf population recovery after quasi-extinct ion in Scandinavia 
(Wakkaben et al, 2001) and in Alaska (Ballard et al., 1987). When available, 
the total sample size was given as the number of sampled adults (in wolves 
the number of pups per packs is usually small). In 2002, n = 90 adult wolves 
were sampled in Yellowstone, living in 14 packs. From the mean-field analy- 
sis, we obtained that P(iV9o < 14) = 0.17. The clustering rate was p = 0.11. 
Table 14.21 reports similar results for the year 2004. In the Alaska, n = 151 
wolves were sampled, living in 30 packs (number of pups not known). From 
the mean-field analysis, we obtained that P(iVi5i < 30) = 0.31. The clus- 
tering rate was p = 0.03. In Scandinavia, 76 wolves were sampled, living in 
12 packs (number of pups not known). From the mean-field analysis, we ob- 
tained that P(iV76 < 16) = 0.18. The clustering rate was p = 0.12. The last 
two p-values may be slight underestimates because the pups were included 
in the sample. 

Lion prides. African lions Panthera leo live in prides that typically consist 
of two males, 4-10 females and their offspring. The adult females are usually 
related to one another and are group members for life. A review of Serengeti 
lion behavior and ecology can be found in Schaffer (1972). We use recent 
data from three sources: Selous Game reserve Tanzania (Spong et al., 2002), 
Serengeti Tanzania (Packer et al. 2005), and Kafue Park Zambia (Carlson et 
al, 2004). The study of social and genetic structure of Selous Game reserve 
lions (Spong et al. 2004) reported the presence of 14 prides, with an average 
number of 5.6 adults (range 2-9) and 2 males in each pride. These observa- 
tions can be turned into an estimate of 51 females in the sample. A recent 
survey of Serengeti lions reports the presence of about one hundred lionesses 
in the park (Packer et al. 2005). Based on an average of 6 females per pride, 
a number of 16-17 prides in Serengeti is consistent with the current data. 
At least 95 adult lions reside in the northern sector of Kafue National Park, 
either living in one of 14 prides or roaming as solitary males (Carlson et al., 
2004). Among the adult lions, there are 31 males and 64 females (a sex ratio 
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Figure 4: Probability distribution of Sgo (left) and iVgo (right) for p = (up) 
and j9 = 0.1 (down) corresponding to wolf packs data (Smith et al. 2002). 



of 1:2). Nine of the 14 prides did not have a sexually mature male residing 
with them. Pride sizes ranged from 2-14 adult animals (mean = 6.4 animals 
per pride). Of the 17 sexually mature males that were identified, six of them 
associated with prides of females while 11 lived either alone or in all- male 
dyads. 

Table 14.21 reports results for the three samples. As for wolves, the lion 
samples exhibit high p-values and low estimates of the clustering rate p. In 
addition, estimates for Zambia may be biased downward because we may 
have included males. Actual values of female counts would exhibit larger 
p-values, lower clustering rates, and an even stronger agreement with the 
mean-field model. 

5 Discussion 

The mean-field networks presented in this article are rough models of social 
aggregation based on a measure of similarity or kinship. While the network 
clearly reflects an aggregation process, this is also clear that the underlying 
tree model does not account for highly structured interactions. 

The results obtained on several real-world networks and biological data 
have shown that the mean-field model is sufficiently robust to capture some 
essential patterns of group formation, especially if imperfect clustering is 
included. In the three examples of social networks, the mean-field model 
was however strongly rejected once (Sante Fe), and weakly rejected twice 
(Zachary's club and Football conferences). Although the test lack power, the 
fact that clustering rates were high suggests that interactions stronger than 
random may be shaping these networks. 

The situation was different and perhaps more interesting in social canivore 
examples for which we observed stronger acceptance of the mean-field model. 
One perceptible conclusion from the results about wolves and lions is that 
the mean-field model predicts the number of communities quite well. This 
does not contradict the fact that more specific models may better explain 
community structure (e.g., Giraldeau and Caraco, 1993). Kin-selection is 
acknowledged to be a major actor of the evolution of social structures in 
wolves and lions, and kin recognition is believed to happen at the same time. 
Although the mean-field model includes kin recognition, it actually neglects 
the effects of selection, or assumes that selection has a very weak impact 
on the shape of the underlying genealogical process. This idea is consistent 
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A. Social networks: network number of p-value rate 

size communities p 



Zachary's 



34 



0.087 0.58 



Sante-Fe 



118 



0.029 0.36 



Football 



115 



12 



0.063 0.2 



B. Social carnivores: sample number of p-value rate 

size packs/prides p 



Yellowstone Wolf 2002 



90 



Yellowstone Wolf 2004 112 



Alaska Wolf 151 



Scandinavian Wolf 76 



Zambia Kafue Lions 95 



Selous Game Lions 51 



Serengeti Lions 100 



14 


0.17 


0.11 


16 


0.12 


0.19 


30 


0.31 


0.03 


12 


0.18 


0.12 


14 


0.145 


0.12 


13 


0.64 


0.00 


16 


0.17 


0.10 



Table 1: Data on community structure. A. Social networks. B. Wolves and 
lions 
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with mathematical studies of selection processes (Neuhauser and Krone 1997, 
Krone and Neuhauser, 1997). Studying population genetics models with 
weak selection, Neuhauser and Krone (1997) actually remarked that weak 
selection does not modify the neutral coalescent topology significantly. 

Although wolves/lions data agree with the perfect clustering mean-field 
model predictions, there are other social species for which the fit may be 
poorer. This may be the case of fish schools or large ungulate herds, where 
other models of group formation may be more appropriate (e.g., Bonabeau 
et al, 1999). For example, in an aerial survey of known and suspected wild 
camels habitat Camelus bactrianus, Reading et al. (1999) estimated group 
density and population size of large ungulates in the south-western Gobi 
Desert in Mongolia. They observed 277 Wild camels in 27 groups, which 
leads to a strong reject of the mean-field model (p- value = 0.026). The same 
is also true for Buffalos that live in herds much larger than the community 
size predicted by the mean-field model. 

In summary, we have presented a mean-field analysis of community struc- 
ture in tree-derived networks that includes an attachment process to closest 
vertex deduced from the tree. Our model is reasonably simple, and we have 
obtained exact results about the typical community sizes and the numbers 
of communities. While community structure in studied social networks has 
exhibited weak departures from the perfect clustering mean-field model, pre- 
dictions of imperfect clustering models with higher clustering rates are con- 
sistent with the data. This suggests that stronger interactions than random 
may be present in these networks. Examples of social animals have pro- 
vided a better fit to the mean-field model. In populations evolving altruistic 
behavior, the results suggest that kin-recognition may contribute to shape 
community structure more significantly than natural selection does itself. 
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Mathematical Appendix 

Results for perfect clustering 

Asymptotics ot p{x) for large x (Distribution of S'„). We have 

oo 

e-1 = $^(-1)7^:! 

k=2 
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and then 

oo 

e-'-e{x)= J](-1)VA:! 

k=x+l 

We deduce that 

e-i - e{x) = (-1)"+V(^ + 1)! + o{l/{x + 1)!), 
and plugging this into the formula for p{x), we obtain 

^(^) ~ 7 TT7 — ^T' as a; ^ oo 

(x — l)[x + 1) 



Second moment of S'„. Regarding the second order moment s^ = -E^f^*^], 
we find that the difference Un+i = s'^_^^i — s^ satisfies 

Un+i = {un - 2(2n + f)) 

n 

which can also be solved, and yields «„ ~ (4A„+i — 2An)/nn\ ~ 4 In con- 
clusion we have 

si ~4n 



Higher moments of Sn- Let us denote by (j)n{t) = E[e^^"-] the moment 
generating function of Sn- Then using the formula of conditional probabilities 
we obtain a functional recurrence equation for 0n(t) 

0„+i(t) = (1 - -)0„(t) + 0„_i(t) + -e*"(e* - 1). (4) 

n n 

Now we set fn{t) = 2e*"(e* — l)/n, and see that the derivatives of fn{t) at 
t = are equal to 

/«(0) = ^±±^yil. + 2((n + 1)'=-^ - n'-') 
n 

for all k > 1. The kth moment of S'„, noted Sn is equal to 0n (0). Thus 
it solves the equation s„_,^^ = (1 — s„ '/n) + s„_i/n + /« (0). If we denote 
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Un = sl^li — sjj, then for all n > 2, fc > 1, we have Un = —Un_i/n + fn (0). 
Using the Newton's binome formula, we check that fn (0) ~ 2kn'^~'^ for large 
n. So we have m„ + u'J_\ln = fn (0) ~ 2kn^"'^. Thus 

Finally, when n grows to infinity and k >2, we have 

A;-l 
as n grows to infinity. 

Mean number of communities. The sequence fn defined by /„ = e„+2 
satisfies 

(n + l)/„ = (n -!)/„_! + 2/„_2 (5) 

where the recursion is initialized as /o = /i = 1. Since the equation satisfied 
by fn is a linear recurrence equation of order 2, /„ is a linear combination 
of two independent solutions of eq. (0). One straightforward solution of eq. 

©is 

Un = n + 4, n> 2 

with uq = 4: and ui = 5. 

The rest of the proof is devoted to finding a solution f„ of eq. (0) with 
initial values f o = and fi = 1. Let us denote h the generating function of 
v„. Then we have 



h(x) 



oo 

E 

n=2 



Using equation (0), we find that h is a. solution of the following differential 
equation 

h{x){l - 2x^ -x) = h'{x){x^ -x) + 2x^ + 2x^ 

Solving the above differential equation leads to 

e-2^-1 -x^ + 2x^ - 2x^ + 2x 

{x — lyx {x — lyx 

Using the Taylor expansion of h, we find that 
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Since /„ is a linear combination of «„ and f „ 

f -1 -1 

we find that the expected value of the number of communities is given by 
^1 l^' (-2)^>-l-t) 



This can be rewritten as 



n-2 , ^^fc n-2 . ^;,,, 



Since the rest (beginning with term n) of convergent alternating series is 
dominated by the absolute value of the (n + 1)*'' of the series, we have 

For large n, this leads to 

en = -^{n - 1) + ^ ^ + 0(27 (n - 1)!). 



Results for imperfect clustering 

Here the results are presented in a reversed order compared to the text. We 
first prove results for the mean number of communities (easier), and give a 
sketch of proof for the mean community size. 

Mean number of communities. Assuming imperfect clustering at rate 
p, the expected number of communities solves the following equation: 

(n + l)e„+2 = ne„+i + 2ge„ + p (6) 

with the initial values 62 = 63 = 1. For p 7^ | we set /„ = e„+2 — p/i'^P ^ 1), 
and obtain that 

(n + l)/„ = n/„„i + 2g/„ (7) 
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Let us denote by hf(x) the generating function of /„ 

oo 



oo 
n=2 



We first investigate a solution fn to equation ((7j) with /o = and /i = 1. 
The generating function solves the following differential equation 

(1 - x - 2qx'^)y = (x^ - x)y' + 2x^ + 2qx^ (8) 



The analytical solution of equation (jH)) is given by 

X q qx 

A series expansion leads to 

_ 1 r(n + 1 + 2g) 1 >A r(n - i + 2g) (2g)^+i 
•^" ~ gr(2g)(n + l)! ^ g ^ r(2g)(n-«)! (« + !)! 

for all n >2. Using the fact that ^" , '^' ~ n^~'^P, we find that 



e" 



-2g 



Now we seek a solution to equation with /o = 1 and /i = 0. The 
generating function can be involved in the following differential equation 

(1 - X - 2qx'^)y = (x^ - x)y' + 2qx^ (9) 

The analytical solution of equation Q is 

1 rx p-2xq 

h{x) = - / 2{y-lf-^^y\p-l)e^y'^dy 



X .Iq " ' ■" ■ ' [X — 1)2'^ 



Using Darboux' result, Theorem 4.12 in (Sedgewick and Flageolet, 1996), we 
find that 

fn ~ e-2%i-2P / 2(1 - yf-^^'y^e-^y^dy 
Jo 



'0 
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Finally, adding the two previous solutions, we find that the solution of the 
original problem is equivalent to 



e-^*? ( 1 



{\ + <lI{p)) "-^"^^ ifp< 1/2 



ifP>l/2 
where we have set 

I{p) = 2 [\l-yy-'''y'e'^ydy. 
Jo 

In the case p = 1/2, e„ satisfies the equation 

1 

(n + l)e„+2 = ne„+i + e„ + - 

Denoting «„ = e„+i — en, we obtain that «„ ~ l/2n and thus we have 
e„ ~ logn/2. 

Community size. Assuming imperfect clustering at rate p, the mean com- 
munity size Sn satisfies the following recursive equation 

s„+i = (1 )s„ + -s„_i + 2p H 

n n n 

Let us denote 





Un -- 


2p 

= n, 

1+p 


and 








fn- 


- Sn Un- 


We remark that «„ 


satisfies the following equation 



Un+l = (1 )Un + -Un-1 + 2p + 



n n n\\ + p) 

Thus, we have 

2pq 



nfn+i = {n-l)fn + qf n-i + (2g - 



l+p' 



Applying the same method as in the previous paragraph for p > 0, we obtain 
that 

Jn "^n Un ^^ -iv 
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where i^ is a constant term. Then this is routine to conclude that 

Sn ~ Mn = -— n, 

1 +P 

which is also in good agreement with numerical values for moderate n. 
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